[Feature] Add zero bubble for spec v2 by litmei · Pull Request #21895 · sgl-project/sglang

litmei · 2026-04-02T02:53:23Z

Motivation

In the current SGLang framework, the EAGLE3 Spec V2 implementation suffers from a CPU-side scheduling bottleneck. Specifically, the CPU dispatch process between consecutive decode steps is "bound" by the draft model's overhead. This creates execution bubbles that cannot be effectively hidden by the overlap scheduler. The goal of this PR is to refactor the scheduling logic to minimize or completely eliminate these CPU-originated bubbles.

Modifications

Asynchronous Data Transfer: Based on profiling results, all identified to("cpu") operations have been made asynchronous using .pin_memory().to("cpu", non_blocking=True) to reduce synchronization stalls. See also this PR: Use pin_memory in forward_batch.init_new to reduce decoding latency#21360
Scheduling Refactor & Draft Pre-execution:
- Reorganized the execution sequence of the draft model. The original "current round" draft task is replaced with prepare_for_verify, which handles input construction for verification or output restoration from the previous round.
- The "next round" draft task is moved forward (pre-executed) to follow the draft_extend phase of the current round. This effectively hides the CPU dispatch latency of the draft model.
- Removal of CPU Synchronizations: To fully eliminate D-to-H (Device-to-Host) bubbles, we removed the synchronization of ForwardBatch.seq_lens_cpu during the drafting phase.
- Scope & Impact: This optimization is best suited for models like DeepSeek-V3.2, which do not rely on seq_lens_cpu during the decode stage. For models like Qwen3 that require these lengths, this change may affect the accepted length (causing it to fluctuate higher or lower).
Rollback PR#21507, the native implementation leads to significant degradation in MTP scenarios.

Accuracy Tests

Theoretically, for dsa model, this feature will have absolutely no impact on its accuracy or accepted length, as seen in DeepSeek V3.2:

We need to investigate why DeepSeek-V3.2 didn't yield any gains here.

Before:

After:

Speed Tests and Profiling

TODO

H20 dsv3.2 (Layer pruning) prof

Before:

After:

Ascend A3 dsv3.2-w8a8 prof

Before:

After:

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

# Conflicts: # python/sglang/srt/model_executor/forward_batch_info.py # python/sglang/srt/speculative/eagle_worker_v2.py

…self_2 # Conflicts: # python/sglang/srt/environ.py # python/sglang/srt/model_executor/forward_batch_info.py

# Conflicts: # python/sglang/srt/model_executor/forward_batch_info.py # python/sglang/srt/model_executor/model_runner.py

gemini-code-assist · 2026-04-02T02:53:27Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

liupeng374 and others added 30 commits December 31, 2025 09:39

[scheduler] scheduler overlap

db9ef5e

add ENABLE_SPECULATIVE_OVERLAP_REFLOW and run success

843c48e

extract draft_v2 function

606e622

extract draft_v2 function

d58c18e

run draft_v2 with acl graph

0fbe1e9

extracting an individual method with design logic

86fc8f3

extracting an individual method with design logic

ae9b07f

extracting an individual method with design logic

828cf17

compatible with the original logic

dcf1e90

compatible with the original logic

7ece9a4

update

546f941

use F.pad repeat cat

c11e9d1

update capture branch info

ab17246

fix self.speculative_num_steps == 1 error

a1f576a

use wrapper to simpl draft

ab48727

reuse draft_forward on prepare_verify_reflow

8883b2f

remove save hidden_states todo

d4f7260

add is_draft_v2 judgement

7e1d116

add envs parm

1a279f7

Merge remote-tracking branch 'official/main' into mtp_scheduler_overlap

8a16507

# Conflicts: # python/sglang/srt/model_executor/forward_batch_info.py # python/sglang/srt/speculative/eagle_worker_v2.py

Merge remote-tracking branch 'official/main' into mtp_scheduler_overlap

b1f2400

fix tp_worker.get_tp_group().cpu_group

d59463d

pre-commit

1a70944

add assert

16e886a

add assert

24f671d

Merge remote-tracking branch 'official/main' into mtp_scheduler_overlap

66422ba

fix precision

801482f

Merge branch 'mtp_scheduler_overlap' into mtp_scheduler_overlap_self

abe6ed8

update reflow： prepare 不走 init

bc93650

pd update

608af80

litmei added 10 commits March 12, 2026 09:22

hang

a5eb6b9

Merge branch 'mtp_scheduler_overlap_self' into mtp_scheduler_overlap_…

2e2ce12

…self_2 # Conflicts: # python/sglang/srt/environ.py # python/sglang/srt/model_executor/forward_batch_info.py

Merge branch 'mtp_scheduler_overlap_self_2' into mtp_v2

1977214

# Conflicts: # python/sglang/srt/model_executor/forward_batch_info.py # python/sglang/srt/model_executor/model_runner.py

update name

87e5864

fix low acc lens

95bb24f

rm seq_lens_cpu sync on draft_v2

402c6a9

rename feature name

a3a1b66

update comment

9bd6e74

rename func

1c9d43c

add test case

2adee15

litmei requested review from ByronHsu, Fridge003, ShangmingCai, Ying1123, hnyls2002, iforgetmyname, ispobock, merrymercy and ping1jing2 as code owners April 2, 2026 02:53

github-actions bot added deepseek npu labels Apr 2, 2026

litmei added 8 commits April 2, 2026 11:08

comments fix

985d796

Merge branch 'main' into mtp_v2_push

94a32a3

fix test case

0e74f9c

fix xeon platform

27d442d

update is_pin_memory_available

39cc874

tiny update

f0661f0

Merge branch 'main' into mtp_v2_push

1a1c836

fix pin_memory not work

4da0ca2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add zero bubble for spec v2#21895

[Feature] Add zero bubble for spec v2#21895
litmei wants to merge 48 commits intosgl-project:mainfrom
litmei:mtp_v2_push

litmei commented Apr 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

litmei commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

H20 dsv3.2 (Layer pruning) prof

Ascend A3 dsv3.2-w8a8 prof

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

litmei commented Apr 2, 2026 •

edited

Loading